Go to Project Gutenberg (http://gutenberg.org) and download your favorite out-of-copyright book in plain text format. Modify your program from the previous exercise to read the book you downloaded, skip over the header information at the beginning of the file, and process the rest of the words as before.
Then modify the program to count the total number of words in the book, and the number of times each word is used.
Print the number of different words used in the book. Compare different books by different authors, written in different eras. Which author uses the most extensive vocabulary?
Modify the program from the previous exercise to print the 20 most frequently used words in the book.
Modify the previous program to read a word list and then print all the words in the book that are not in the word list. How many of them are typos? How many of them are common words that should be in the word list, and how many of them are really obscure?
Given the same inputs, most computer programs generate the same outputs every time, so they are said to be deterministic. Determinism is usually a good thing, since we expect the same calculation to yield the same result. For some applications, though, we want the computer to be unpredictable. Games are an obvious example, but there are more.
Making a program truly nondeterministic turns out to be difficult, but there are ways to make it at least seem nondeterministic. One of them is to use algorithms that generate pseudorandom numbers. Pseudorandom numbers are not truly random because they are generated by a deterministic computation, but just by looking at the numbers it is all but impossible to distinguish them from random.
The function rand
returns a random float between 0.0
and 1.0
(including 0.0 but not 1.0). Each time you call random, you get the next number in a long series. To see a sample, run this loop:
In [1]:
for i in 1:10
x = rand()
println(x)
end
The function rand
can take an iterator or array as argument and returns a random element:
In [10]:
for i in 1:10
x = rand(1:6)
print(x, " ")
end
In [14]:
for i in 1:10
x = rand(['a','b','c'])
print(x, " ")
end
In [15]:
function process_file(filename)
hist = Dict()
for line in eachline(filename)
process_line(line, hist)
end
hist
end
Out[15]:
In [24]:
function process_line(line, hist)
line = replace(line, '-', ' ')
for word in split(line)
word = string(filter(isalpha, [word...])...)
word = lowercase(word)
hist[word] = get!(hist, word, 0) + 1
end
end
Out[24]:
In [25]:
hist = process_file("../data/emma.txt")
Out[25]:
This program reads "emma.txt"
, which contains the text of Emma by Jane Austen.
process_file
loops through the lines of the file, passing them one at a time to process_line. The histogram hist
is being used as an accumulator.
process_line
uses the string method replace to replace hyphens with spaces before using split to break the line into an array of strings. It traverses the array of words and uses filter
, isalpha
and lowercase
to remove punctuation and convert to lower case. (It is a shorthand to say that strings are “converted”; remember that strings are immutable, so a function like lowercase
return new strings.)
Finally, process_line
updates the histogram by creating a new item or incrementing an existing one.
To count the total number of words in the file, we can add up the frequencies in the histogram:
In [29]:
function total_words(hist)
sum(values(hist))
end
total_words(hist)
Out[29]:
The number of different words is just the number of items in the dictionary:
In [30]:
function different_words(hist)
length(hist)
end
different_words(hist)
Out[30]:
In [51]:
function most_common(hist)
t = []
for (key, value) in hist
push!(t, (value, key))
end
reverse!(sort!(t))
end
Out[51]:
In each tuple, the frequency appears first, so the resulting list is sorted by frequency.
Here is a loop that prints the ten most common words:
In [53]:
t = most_common(hist)
println("The most common words are: ")
for (freq, word) in t[1:10]
println(word, "\t", freq)
end
In [54]:
function print_most_common(hist, num=10)
t = most_common(hist)
println("The most common words are: ")
for (freq, word) in t[1:num]
println(word, "\t", freq)
end
end
Out[54]:
The first parameter is required; the second is optional. The default value of num
is 10
.
If you only provide one argument:
print_most_common(hist)
num
gets the default value. If you provide two arguments:
print_most_common(hist, 20)
num
gets the value of the argument instead. In other words, the optional argument overrides the default value.
If a function has both required and optional parameters, all the required parameters have to come first, followed by the optional ones.
Finding the words from the book that are not in the word list from "words.txt"
is a problem you might recognize as set subtraction; that is, we want to find all the words from one set (the words in the book) that are not in the other (the words in the list).
subtract
takes dictionaries d1
and d2
and returns a new dictionary that contains all the keys from d1
that are not in d2
. Since we don’t really care about the values, we set them all to nothing
.
In [56]:
function subtract(d1, d2)
res = Dict()
for key in keys(d1)
if key ∉ keys(d2)
res[key] = nothing
end
end
res
end
Out[56]:
To find the words in the book that are not in "words.txt"
, we can use process_file to build a histogram for "words.txt"
, and then subtract
:
In [58]:
words = process_file("../data/words.txt")
diff = subtract(hist, words)
println("Words in the book that aren't in the word list:")
for word in keys(diff)
print(word, " ")
end
Some of these words are names and possessives. Others, like "rencontre"
, are no longer in common use. But a few are common words that should really be in the list.
In [69]:
function random_word(h)
t = []
for (word, freq) in h
for i in 1:freq
push!(t, word)
end
end
rand(t)
end
Out[69]:
This algorithm works, but it is not very efficient; each time you choose a random word, it rebuilds the list, which is as big as the original book. An obvious improvement is to build the list once and then make multiple selections, but the list is still big.
When you are debugging a program, and especially if you are working on a hard bug, there are five things to try:
Reading:
Examine your code, read it back to yourself, and check that it says what you meant to say.
Running:
Experiment by making changes and running different versions. Often if you display the right thing at the right place in the program, the problem becomes obvious, but sometimes you have to build scaffolding.
Ruminating:
Take some time to think! What kind of error is it: syntax, runtime, or semantic? What information can you get from the error messages, or from the output of the program? What kind of error could cause the problem you’re seeing? What did you change last, before the problem appeared?
Rubberducking:
If you explain the problem to someone else, you sometimes find the answer before you finish asking the question. Often you don’t need the other person; you could just talk to a rubber duck. And that’s the origin of the well-known strategy called rubber duck debugging.
Retreating:
At some point, the best thing to do is back off, undoing recent changes, until you get back to a program that works and that you understand. Then you can start rebuilding.